Conditional Recurrent Neural Networks

An Application to ACKR3 Ligand Discovery

Mark James Thompson (ACLS)

2025-07-22

Aim of the Project

  • Problem Statement: The goal is to find a novel chemical entities that will bind to a receptor using a generative deep-learning model.

  • BUT there are about 10^60 possible clinical candidates! (< 500 Da and meet Lapinski’s rule)

  • Traditional approach: Search for properties via databases and test.

  • Generative D-L approach: Train on known entities and their properties, and predict which ones might work

The Target: ACKR3 – a GPC-receptor

  • Motivation: ACKR3 is plays a role in:
    • angiogenesis (development of new blood vessels);
    • tumor growth and metastatic processes;
    • neurological conditions; etc.

Source (Yen 2022)

RNN Extensions:

  • Olivecrona (2017): gradient clipping

  • Preuer (2018): 2 CNN layers pre-pended

  • Kotsias (2020): Conditional with pre-pended QSAR properties

  • Xu (2021): Conditional Coulomb matrix of the pocket

–> I use a conditional RNN to generate the molecules

Data: Sources

  • ACKR3 ligands:
    • Vreije Universiteit NL (Riemens 2023);
    • InterAx; and,
    • papers on ACKR3.
  • ZINC dataset: Known clinical molecules 10’000s of entities
  • ChEMBL: Ligands known to work with GPCR protein receptors

Data: Aggregation

KNIME Workflow

Data: Relevant Features (RDKit/gnina)

Metric
ACKR3 binding affinity
Synthetic accessibility score
Molecular weight
Water−octanal partition coefficient
Number of hydrogen bond acceptors
Number of hydrogen bond donors
Number of aromatic rings

Data

Distribution of mass (Da) amongst different classes of molecules

Data: Mild Correlation between Binding and pCHEMBL

KNIME reports a modest but statistcaly significant correlation of 0.236 between gnina docking score the pChemBL (p-value=0.000 on 575 degrees of freedom).

Example of Inverse Agonist

Example of ACT-1004, an inverse agonist

Model: Trials and Tribulations

  • Iterative Process:
    • VAE graph info sparse encoding into network and features error prone

    • GAN unstable, hardware issue optimizing

    • –> The RNN seemed to tune the best out of the box, and give relatively similar SMILEs without any tuning.

Model Architecture: RNN simple, yet powerful

  • Chemistry can be fairly well expressed in terms of a sequence of elements and radicals

  • Given the nature of the SMILE text small storage and memory requirements

  • Simple logic, works on minor hardware (MacPro w/ nVidia GPU)

  • Clinical properties integrated with conditional RNN architecture

Model: Schema

Model: Conditional RNN Formulation

Both the output and hidden process condition on the chemical properties of the molecule.

  • Hidden process \[ h_t = f(W_{xh}x_t + W_{hh} h_{t-1} +\mathbf{W}_{ch}\mathbf{c}+b_h) \\ \]

  • Output process \[ y_t = g(W_{hy} h_t +\mathbf{W}_{cy}\mathbf{c}+b_y) \]

Model: Core Model Formulation

Layer (type) Output Shape Param #
Tokens (Embedding) (256, 400, 128) 4,992
Feature layers 1-7 (Dense) (256, 16) 32
LSTM layer 1-3 ((256, 400, 256), 508,928
(256, 256), (256,256))
Dropout layer 1-3 - 20%
Output (Dense) (256, 400, 39) 10,023

Model: Loss

  • Batch size restricted due to hardware limitations
  • Optimizer was Adam with a default learning rate of: 0.0001
  • Loss function was sparse cross-entropy, which matches the encoding of each character as an integer: \[ L = - \frac{1}{N} \Sigma_{i=1}^T [ln(p_{y=c,i})] \]

Models: Overview of RNN Models

Paper Nodes Layers Dropout Sample Size Notes
Bjerrum (2017a) 256 2 0.1 1.6m, 13.2m 2 more FF NN layers post-pended,
Bjerrum (2017b) 64 1 0.0 673 (89k) (enumerated, cf. infra)
Gupta (2017) 256 2 yes 542k lengths 34 to 74
Olivecrona (2017) 1024 3 ? 1.5m gradient clipping, agent optimizer
Segler (2018) 1024 3 0.2 + d.o. layers 1.4m gradient clipping
Preuer (2018) ? 2 ? 200k 2 cnn + maxpool pre-pended
Polykovskiy (2020) 768 3 0.2 1.76m CharRNN model
Kotsias (2020) 256 3 ? 1.34m 6 dense layers pre-pended for properties
Grisoni (2020) 512 2 0.3 272k
Xu (2021) 512 2 0.3 194k 2 dense layers pre-pended for Coulomb matrix

Results: Accuracy

  • Very low out of sample validity, lots of invalid SMILES (97.5%) depending
  • Gains low after 64+ epochs

Optimization in a general run

Results: Validity

Overview of RNN attributes.
Model Validity (%) Uniqueness (%) Novelty (%) Avg. QED Avg. SA Avg. SlogP Internal Diversity
Real 100.0 100 0.0 0.646 3.07 3.15 0.877
Polykovskiy 64.6 100 100.0 0.772 2.06 3.21 0.798
Bjerrum 72.7 100 100.0 0.782 2.42 3.52 0.838
Gupta 75.0 100 100.0 0.801 2.28 3.60 0.826
Kostias 2.1 100 100.0 0.061 2.91 6.38 NA
Segler 67.3 100 100.0 0.802 2.16 3.31 0.821
Olivecrona 0.6 100 100.0 0.095 3.38 13.00 0.715
XuGrisini 2.7 100 100.0 0.092 2.93 19.90 0.539
64h512_2m 79.2 100 100.0 0.774 2.26 3.33 0.832
64h512_4m 75.5 100 99.8 0.750 2.31 3.61 0.836
64h512_8m 78.6 100 100.0 0.776 2.25 3.19 0.833
64h1024_4m 77.1 100 100.0 0.790 1.99 3.33 0.789

Results: DeNovo Shifting Distribution Higher

Results: Example DeNovo Molecules

Molecular view of candidate ligand A. Note the benzene ring and the nitrogen that we saw in the known agonist. Molecular view of candidate ligand B. Note the benzene ring and the nitrogen that we saw in the known agonist.

Results: de Novo like Training Distribution

t-SNE

Results: A (un)known SMILE

  • String: “Cc1cc(CN)n(n1)[C@H]1CC@@Hc1cc2cc(C)ccc2n1Cc1ccc(Cl)cc1”
  • Received a gnina CNN ligand-ACKR3 binding score of 0.961

Nicotinamide N-methyltransferase (NNMT)

Boltz-2 Simulation: Baseline

The inverse agonist ACT-1004-1239

Boltz-2 Simulation: Potential Candidate

Candidate inverse agonist A

Boltz-2 Simulation: Bad Candidate

Candidate D penetrates the trans-membrane helix: both “bad” docking, and the molecule. We can see a long-chain polymer (orange) penetrates the trans-membrane helices. This candidate is likely spurious or will be disruptive to the GPCR.

Discussion: Challenges

  • Low-level hardware issues (galore!): gru activation function, GPU libraries, etc.
  • Compute resources scarce: Binding score has taken 7+ months of compute time to generate! Thousands of files, slow cycle times.
  • Large IT setup and pre-work: Python versionning and environment, undocumented tools
  • Validating the issue encountered is the de novo ligand.

Discussion: Future Directions & Ideas

  • Add topological information of the molecule through graph or structural fingerprints
  • A larger datasets with all GPCR ligands would be useful including those that have been docked in a complex.
  • An interesting idea would be to investigate to include simulation input as a CNN.

Conclusion

  • Achievements:
    • Scored ~97’000 known candidate ligands for their potential to dock in complex with ACKR3 using gnina’s CNN docking function
    • Used binding score to find ~200 novel candidate ligands for ACKR3 obtaining some therapeutic compound leads
  • Key Takeaway: Demonstrates the potential of conditional RNNs for developing candidate molecules tailored to a specific domain.

Thank You & Questions?

  • Thanks to Tomek at InterAx Biotech for showing me MD and helping me with the the biology
  • Thanks Manuel Dömer for putting up with my inner conflict
  • Ready for questions.

Appendix: Feature Summary Statistics

type N ExactMW_mu ExactMW_sigma ExactMW_min ExactMW_max
GLASS 562 463.0917 89.21046 343.22598 2050.0366
GPCR_ligand 21365 422.7739 122.97633 75.03203 2065.0475
ZINC_molecule 66287 349.0975 49.96812 202.10275 495.2169
deNovo 2762 845.8805 384.62142 159.14919 1255.8751
interax 359 405.6960 71.15851 164.10620 600.3788

Appendix: Feature Summary Statistics

sLogP Summary Statistic on Scored Data
type N SlogP_mu SlogP_sigma SlogP_min SlogP_max
GLASS 562 4.374149 1.2522874 -6.26022 6.8349
GPCR_ligand 21365 4.040989 1.4639035 -9.50710 10.8638
ZINC_molecule 66287 3.038343 0.9987149 -0.13210 4.9280
deNovo 2762 7.728784 6.4347880 -8.60250 34.9649
interax 359 3.870362 0.9736113 0.42854 6.7920

Appendix: Feature Summary Statistics

gnina Binding Score Summary Statistic on Scored Data
type N CNNscore_mu CNNscore_sigma CNNscore_min CNNscore_max
GLASS 562 0.6859925 0.0957217 0.0548102 0.9527524
GPCR_ligand 21365 0.7133912 0.1457480 0.0142750 0.9837449
ZINC_molecule 66287 0.7668422 0.1242323 0.0208747 0.9906685
deNovo 2762 0.7669212 0.2231383 0.0930477 0.9640365
interax 359 0.7679811 0.1044172 0.4975969 0.9684185

Appendix: Feature Summary Statistics

Rotatable Bonds Summary Statistic on Scored Data
type N NumRotatableBonds_mu NumRotatableBonds_sigma NumRotatableBonds_min NumRotatableBonds_max
GLASS 562 6.496441 2.171971 2 45
GPCR_ligand 21365 5.559373 3.396171 0 48
ZINC_molecule 66287 4.684932 1.600661 1 8
deNovo 2762 48.868573 32.237210 0 86
interax 359 5.688022 2.245454 0 13

Appendix: Improvement of the RNN for SMILE chemistry

We played around with replacement of digraphs, trigraphs with singletons, like.

  • “As”: “🜺”, # Alchemical Arsenic
  • “Ag”: “☽”, # Alchemical symbol for Silver (Moon)
  • “Na”: “钠”, # Nu for Sodium whose Mandarin name is “nà”
  • “Be”: “铍”, # Chinese character for Beryllium
  • “Bi”: “♆”, # Alchemical Bisthmuth

Gains were small to nil. As the combinations lengthen, there are fewer of them to replace.